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Existing optimal estimators of nonequilibrium path-ensemble averages are shown to fall within the 
framework of extended bridge sampling. Using this framework, we derive a general minimal- variance 
estimator that can combine nonequilibrium trajectory data sampled from multiple path-ensembles 
to estimate arbitrary functions of nonequilibrium expectations. The framework is also applied to 
obtaining asymptotic variance estimates, which are a useful measure of statistical uncertainty. In 
particular, we develop asymptotic variance estimates pertaining to Jarzynski's equality for free en- 
ergies and the Hummer-Szabo expressions for the potential of mean force, calculated from uni- or 
bidirectional path samples. Lastly, they are demonstrated on a model single-molecule pulling exper- 
iment. In these simulations, the asymptotic variance expression is found to accurately characterize 
the confidence intervals around estimators when the bias is small. Hence, it does not work well 
for unidirectional estimates with large bias, but for this model it largely reflects the true error in a 
bidirectional estimator derived by Minh and Adib. 
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I. INTRODUCTION 

Path-ensemble averages play a central role in nonequi- 
librium statistical mechanics, akin to the role of con- 
figurational ensemble averages in equilibrium statisti- 
cal mechanics. Expectations of various functionals 
over processes where a system is driven out of equi- 
librium by a time-dependent external potential have 
been shown to be related to equilibrium properties, in- 
cluding free energy differences 1,2 and thermodynamic 
expectations. 3,4 The latter relationship, between equilib- 
rium and nonequilibrium expectations, has been applied 
to several specific cases, such as: the potential of mean 
force (PMF) along the pulling coordinate 5,6,7 (or other 
observed coordinates 8 ) in single-molecule pulling experi- 
ments; RNA folding free energies as a function of a con- 
trol parameter; 9 the root mean square deviation from a 
reference structure; 10 the potential energy distribution 10 
and average; 11 and the thermodynamic length. 12 

Compared to equilibrium sampling, nonequilibrium 
processes may be advantageous for traversing energetic 
barriers and accessing larger regions of phase space per 
unit time. This is useful, for example, in reducing the ef- 
fects of experimental apparatus drift or increasing the 
sampling of barrier-crossing events. Thus, there has 
been interest in calculating equilibrium properties from 
nonequilibrium trajectories collected in simulations or 
laboratory experiments. Indeed, single-molecule pulling 
data has been used to experimentally verify relationships 
between equilibrium and nonequilibrium quantities. 13,1 

While many estimators for free energy 
differences 3,15,16,17 and equilibrium ensemble averages 
can be constructed from nonequilibrium relationships, 
they will differ in the efficiency with which they utilize 
finite data sets, leading to varying amounts of statistical 



bias and uncertainty. Characterization of this bias 
and uncertainty is helpful for comparing the quality of 
different estimators 18 and assessing the accuracy of a 
particular estimate. The statistical uncertainty of an 
estimator is usually quantified by its variance in the 
asymptotic, or large sample, limit, where estimates 
from independent repetitions of the experiment often 
approach a normal distribution about the true value 
due to the central limit theorem. It is an important 
goal to find an optimal estimator which minimizes this 
asymptotic variance. 

Although numerical estimates of the asymptotic vari- 
ance may be provided by bootstrapping (e.g. Ref. 19 ), 
closed-form expressions can provide computational ad- 
vantages in the computation of confidence intervals, al- 
low comparison of asymptotic efficiency, 18,20 and facili- 
tate the design of adaptive sampling strategies to target 
data collection in a manner that most rapidly reduces 
statistical error. 21,22,23 In the asymptotic limit, the sta- 
tistical error in functions of the estimated parameters can 
be estimated by propagating this variance estimate via a 
first-order Taylor series expansion. While this procedure 
is relatively straightforward for simple estimators, it can 
be difficult for estimators that involve arbitrary functions 
(e.g. nonlinear or implicit equations) of nonequilibrium 
path-ensemble averages. 

Fortunately, the extended bridge sampling (EBS) 
estimators, 20,24,25,26 a class of equations for estimating 
the ratios of normalizing constants, are known to have 
both minimal-variance forms and associated asymptotic 
variance expressions. Recently, Shirts and Chodera 27 ap- 
plied the EBS formalism to generalize the Bennett ac- 
ceptance ratio, 15 producing an optimal estimator com- 
bining data from multiple equilibrium states to com- 
pute free energy differences, thermodynamic expecta- 
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tions, and their associated uncertainties. Here, we ap- 
ply the EBS formalism to estimators utilizing nonequilib- 
rium trajectories. We first construct a general minimal- 
variance path-average estimator that can use samples 
collected from multiple noncquilibrium path-ensembles. 
We then show that some existing path-average estima- 
tors using uni- and bidirectional data are special cases 
of this general estimator, proving their optimality. This 
also allows us to develop asymptotic variance expressions 
for estimators based on Jarzynski's equality 1,2 and the 
Hummer-Szabo expressions for the PMF. 5,6,28 We then 
demonstrate them on simulation data from a simple one- 
dimensional system and comment on their applicability. 



II. EXTENDED BRIDGE SAMPLING 

Suppose that we sample Ni paths (trajectories) from 
each of K path-ensembles indexed by i = 1, 2, K. The 
path-ensemble average of an arbitrary functional J~[X] in 
path-ensemble i is defined by 

= J dXF[X] Pi [X], (1) 

where is a probability density over trajectories, 

Pi [X] =c~ 1 q l [X] ■ c t = J dX qi [X], (2) 

with unnormalizcd density qi[X] > and the normaliza- 
tion constant Ci (a path partition function). The above 
integrals, in which dX is an infinitesimal path element, 
are taken over all possible paths, X. Extended bridge 
sampling estimators provide a way of estimating ratios 
of normalization constants Ci/cj, which will prove useful 
in estimating free energies and thermodynamic expecta- 
tions. 

To construct these estimators, we first note the impor- 
tance sampling identity, 



("»?)>, = 



dX qi [X] 



dX qj [X] 
= c j (<Xij Qi)j > 



JdXa t] [X}q t [X} qj [X} 

J dX qi [X] 
JdXaijWqiWqjlX} 



JdX qj [X] 



(3) 



where j is another path-ensemble index, ctij [X] is an ar- 
bitrary functional of X, and all normalization constants 
are nonzero. 

Summing over the index j in Eq. 3 and using the sam- 
ple mean, iV" 1 En=i J~[Xin], as an estimator for (J 7 ) i , 
we obtain a set of K estimating equations, 



K „ Ni 
Ci 



K 



N, 



path, Xi n , is indexed by the ensemble i from which it is 
sampled, and the sample number n = l,2,...,TVj. This 
coupled set of nonlinear equations defines a family of 
estimators parameterized by the choice of [X] , all of 
which are asymptotically consistent, but whose statistical 
efficiencies will vary. 20 
With the choice, 



otij[X] = — , 

k=l 

Eq. 4 simplifies to the optimal EBS estimator, 

-l 



(5) 
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Ck q%[Xjn] 
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This choice for onj [X] is optimal in that the asymptotic 
variance of the ratios Ci/cj is minimal. 20,2 These equa- 
tions may be solved by any appropriate algorithm, in- 
cluding a number of efficient and stable methods sug- 
gested by Shirts and Chodera. 27 

The asymptotic covariance of Eq. 6 is estimated by, 



= M T (I N - MNM T ) + M 



(7) 



where the elements of are the covariances of the log- 
arithms of the estimated normalization constants, Qij = 
cov(7i,7j), and 7, = lnc^. 26 The superscript (...) + de- 
notes an appropriate generalized inverse, such as the 
Moore-Penrose pseudoinverse, In is the TV x TV identity 
matrix (where TV = E;=i 1S the total number of sam- 
ples), TV = diag (Ni, TV2, Nk) is the diagonal matrix 
of sample sizes, and M is the TV x K weight matrix with 
elements, 



qdXr, 



K 



(8) 



E N k c^q k [X n ] 
k=l 



In this matrix, the distribution from which samples are 
drawn from is irrelevant, and X is only indexed by 
n = 1, . . . , TV. We note that the sum over each column, 

E„=i M m, is one. 

For arbitrary functions of the logarithms of the nor- 
malization constants, c/>(tl, ■ ■■,jk) and ^(71, ..., , jk)i the 
asymptotic covariance cov(</>, ip) can be estimated from 
according to, 



K 



cov 



? ^ d(j) - dtp 
1,1=1 IJ 



(9) 



E E a v & = E E a v * 

j=l 1 n=l 



],(4) 



n=l 



whose solutions yield estimates Cj for the normalization 
constants Cj, up to an irrelevant scalar multiple. Each 



through first-order Taylor series expansion of 4> and ip. 

III. GENERAL PATH-ENSEMBLE AVERAGES 

Following previous work, 27,29 we estimate noncquilib- 
rium expectations by defining additional path-ensembles 
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with "unnormalizcd densities" 

qr i [X]=F[X]q i [X] ; c Fi = J dXq^[X\. (10) 

Using Eqs. (1), (2), and (10), we can express nonequilib- 
rium expectations as a ratio of the appropriate normal- 
ization constants, (J 7 ) i = c^Jcj. Notably, this can be 
estimated without actually sampling path-ensembles bi- 
ased by some function of J-[X] (although it is sometimes 
possible to do so in computer simulations ' via transi- 
tion path sampling 32,33 ). If no paths are drawn from the 
path-ensemble corresponding to q Fi [X], then Njr. = 
and it is no longer required that q Fi [X] > 0. 20,27 

For each defined path-ensemble, the weight matrix M 
is augmented by one column with elements, 



71 r a -i fflftW 
M nFi = c T . — . 

E N k c^q k [X n ] 

k=l 



(11) 



The estimator for the path-ensemble average, Ti « (F) it 
can be expressed in terms of weight matrix elements, 



N 



J~i — J]] Mni F[X n ] , 



(12) 



and its uncertainty estimated by 

a\Ti) « 7?(Q rtri -2® Fii + Q ii ). (13) 



IV. EXPERIMENTALLY RELEVANT 
PATH-ENSEMBLES 

The above formalism is fully general, and may be ap- 
plied to any situation where the ratio gj[X]/gj[X] can be 
computed. For arbitrary path-ensembles, unfortunately, 
calculating this ratio is only possible in computer sim- 
ulations unless certain assumptions are made about the 
dynamics. 3 In a few special path-ensembles, however, 
we can use the Crooks fluctuation theorem 35,36 to esti- 
mate this ratio, allowing us to apply the EBS estimator 
to laboratory experiments. We examine these here. 

First, consider a forward process, in which a system, 
initially in equilibrium, is propagated under some time- 
dependent dynamics for a time r which may cause it to 
be driven out of equilibrium. The time-dependence of 
the evolution law (e.g. Hamiltonian dynamics in a time- 
dependent potential) is the same for all paths sampled 
from this ensemble. 

For a sample of paths only drawn from this ensemble, 
the optimal EBS estimator of (F)f reduces to the sample 
mean estimator, which we call the unidirectional path- 
ensemble average estimator 



1 



N f 



(14) 



and the associated asymptotic variance from Eq. 9 re- 
duces to the variance of the sample mean (see Appendix 
A) 



1 



N f 



±- £ (T[X fn ] T 
j „ = i 



f) 



(15) 



The forward process has a unique counterpart known 
as the reverse process. Here, the system moves via the 
opposite protocol in thermodynamic state space; after 
initial configurations are drawn from the final thermo- 
dynamic state of the forward path-ensemble, they are 
driven towards the initial state. If the dynamical law 
satisfies detailed balance when the control parameters are 
held constant at each fixed time t, the path probabilities 
in the conjugate forward and reverse path-ensembles are 
related according to the Crooks fluctuation theorem: 35 ' 36 



p f [X] _ q f [X] c r 



p r [X] q r [X] cf 



= e w T [x]-Af T = e n[x] 



(16) 



in which X is the time-reversal, or conjugate twin, 37 of 
X, Aft = — m(ct/co) is the dimensionless free energy 
difference between thermodynamic states at times and 
t (with t being the fixed total trajectory length) and 
u>t[X] is the appropriate dimensionless work. In Hamil- 
tonian dynamics, for example, this work is w t [A] = 
Pf*dt'(dH/dt'). For convenience, we define the total 
dissipative work as Q,[X] = w T [X] — A/ r . 

We will refer to data sets which only include realiza- 
tions from the forward path-ensemble as 'unidirectional', 
and those with paths from both path-ensembles as 'bidi- 
rectional'. Notably, sampling paths from these conjugate 
ensembles and calculating the associated work ^[X] is 
possible in single-molecule pulling experiments as well as 
computer simulations (cf. Refs. 6 ' 13 ). To combine bidi- 
rectional data to estimate {F) f, we apply the Crooks 
fluctuation theorem 35,36 to Eq. 6 and divide by c/, lead- 
ing to, 



Ft 



E 



{Nf + N r e 



N r 

E 



T[X r 



\ N f + N r e 



(17) 

which is bidirectional path-average estimator of Minh 
and Adib, 28 derived here by a different route which 
demonstrates its optimality. (The asymptotic variance 
estimator for this equation is written in a closed form in 
Appendix B.) In these bidirectional expressions, samples 
drawn from the reverse path-ensemble are time-reversed 
to obtain the paths X rn . The dissipated work estimate, 
Ct[X] = w T [X] — Af T , requires an estimate of A/ r . A 
method for obtaining this estimate will be described next. 



FREE ENERGY 



Jarzynski's equality, 1 ' 2 



-Aft 



(18) 



4 



relates noncquilibrium work and free energy differences. 
To facilitate the use of EBS in Jarzynski's equality, we 
define a path-ensemble by choosing T[X] = e 
Eq. 10, leading to 



-w t [X] 



Qw t [X] 



-w t [X] 



9/W 



dXe.- Wt[x] q f [X}. 



(19) 

When only unidirectional data is available, the optimal 
EBS estimator for Jarzynski's equality is 



-A/ t 



N f 



N f 



-wt[X fn ] 



(20) 



and its asymptotic variance is straightforwardly given by 
error propagation. 38 Estimators 30,31,39 and asymptotic 
variances 39, have also been developed for unidirectional 
importance sampling forms of the equality. 

When bidirectional data is available, the same choice 
of F[X] in Eq. 17 gives the estimator 



-Af t 



-w t [X fr 



E 



-w t [X rn ] 



N f + N r e- fi [ x /«] ^ Nf + N r e 



-ti[x„ 



(21) 



In this equation, choosing t — or t — t leads to an im- 
plicit function mathematically equivalent to the Bennett 
acceptance ratio method, 3,15 as previously explained. 28,41 
The asymptotic variance of Aft is calculated by augment- 
ing the matrices M and and using <j> = ip = Af t = 
— \n(c Wt /cf) in Eq. 9, such that, 



a 2 (Af t ) =Q Wt , 



26 



w t f 



e 



(22) 



VI. POTENTIAL OF MEAN FORCE 



Building on Jarzynski's equality, Hummer and Sz- 
abo developed expressions for the PMF, 5,6 the free en- 
ergy as a function of a reaction coordinate rather than 
a thermodynamic state, that may be used to interpret 
single-molecule pulling experiments. In these experi- 
ments, a molecule is mechanically stretched by a force- 
transducing apparatus, such as an laser optical trap or 
atomic force microscope tip (c.f. 6 ). The Hamiltonian gov- 
erning the time evolution in these experiments, H(x; t) = 
H (x) + V(z(x);t), is assumed to contain both a term 
corresponding to the unperturbed system, Hq(x), and a 
time-dependent (typically harmonic) external bias poten- 
tial imposed by the apparatus, V(z; t), which acts along a 
pulling coordinate, z(x). As the coordinate zt = z(x(t)) 
is observed at fixed intervals At over the course of the 
experiment, we will henceforth use t = 0,1,..., T as an 
integer time index. We calculate the work with a dis- 
crete sum as w t = J2n=i [ v n(z n ) - V n -i(z n )], where 
V n {z) = V{z-nAt). 

While the expressions in Section V provide an esti- 
mate of relative free energies of the equilibrium thermo- 
dynamic states defined by H(x;t), they are not imme- 
diately useful as an estimate for the PMF along z. 5,6,42 



By applying the noncquilibrium estimator for thermody- 
namic expectations, 3,4 it was shown that the PMF in the 
absence of an external potential is given by 5,6 



-ffoO) 



(S(z~z t )e-^) 



w t \ p V(z t ;t) 



f 



(23) 



where the dimensionless PMF, go(z), is defined in rela- 
tion to the normalized density as go(z) = — hipo(z) — Sg. 
In this equation, Sg is a time-independent constant, 
e~ s 9 =Jdx e~ H ^ I J dx e - H o(^e 

This theorem can be used to develop estimators for 
the PMF by replacing the delta function using a kernel 
function of finite width, such as, 



h(z 



zt) = 




< 



(24) 



The width Az must be small so that e v ^ z ' ,t ^ > does not vary 
substantially across it. 

As this theorem is valid at all times, it is possible to 
obtain an asymptotically unbiased density estimate p t 
from each time slice. It is far more efficient, however, to 
estimate the PMF using all recorded time slices. While 
any linear combination of time slices will lead to a valid 
estimate, certain choices will be more statistically effi- 
cient (leading to lower variance) than others. One way 
to combine time slices is to use the asymptotic covari- 
ance matrix in the method of control variates, 20 lead- 
ing to a generalized least-squares optimal estimate of the 
PMF. Unfortunately, we empirically found this approach 
to be numerically unstable. A more numerically stable 
approach, which was proposed by Hummer and Szabo, 5,6 
is based on the weighted histogram analysis method, 43,44 



Po{z) 



J2t^t(z)p t ( 2 



; Mz)^e~ v ^ +A ^. (25) 



While this weighting scheme is optimal, in a minimal- 
variance sense, for independent samples from multiple 
equilibrium distributions, these assumptions do not hold 
for time slices from noncquilibrium trajectories. How- 
ever, Oberhofcr and Dcllago did not observe substantial 
improvement in PMF estimates when using other time- 
slice weighting schemes. 45 

By defining the path-ensemble, 



[X] =6(z-z t )e- w *Wqf[X];c Zt = J dX q Zt [X] (26) 



and making use of Jarzynski's equality (Eq. 18) for e A ? t , 
we can write Hummer and Szabo's PMF estimator as 



-9o(z) 



£ t e-^*) (c f /c wt y 



(27) 



which can be readily analyzed in terms of EBS. While 
Hummer and Szabo proposed using the unidirectional 
path average estimator (Eq. 14) to estimate the expec- 
tations in Eq. 27, Minn and Adib later applied a bidi- 
rectional estimator (Eq. 17), leading to significantly im- 
proved statistical properties. 28 
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The asymptotic variance of these estimators can be 
determined by choosing </> = ?/;= p (z) in Eq. 9. For 
the bidirectional estimator, the matrices M and © will 
contain one column each for the / and r path-ensembles, 
and T+l columns each for the path-ensembles associated 
with {w t }J =0 and {z t }J =0 . The relevant partial deriva- 
tives are, 




where 7, = lnc,, N = J2t( Cz t I ' c w t ) is the numerator of 
Eq. 27, and T> = J^. e~ v{ ~ z '^ (cf/c Wt ) is its denominator. 
These lead to an estimate for a 2 (p (z)). Finally, the 
asymptotic variance in the PMF is given by the error 
propagation formula, a (go(z)) ps a (fto(z))/po(z) . 




0.1 0.2 0.3 0.4 0.5 0.6 0.7 
Time (s) 



FIG. 1: Comparison of estimators for Aft: This figure is 
similar to Fig. 1 of Ref. 28 , except that error bars are now 
included and the sample size is halved. The unidirectional 
estimator (Eq. 20) is applied to 250 forward (rightward tri- 
angles) or reverse (leftward triangles, time reversed) sampled 
paths, and the bidirectional estimator (Eq. 21) to 125 paths 
in each direction (upward triangles). The exact Aft is shown 
as a solid line. Error bars (sometimes smaller than the mark- 
ers) denote one standard deviation of Aft, estimated using 
the expressions presented here. The vertical dashed lines are 
at the times considered in Fig. (4). 



VII. ILLUSTRATIVE EXAMPLE 

We demonstrate these results with Brownian dynam- 
ics simulations on a one-dimensional potential with 
Uq(z) — (5z 3 — lOz + 3)z, which were run as previ- 
ously described. 28 A time-dependent external perturba- 
tion, V(z;t) = k s (z — z(t)) 2 /2, with k s = 15 is applied, 
such that the total potential is U(z; t) — Uo(z) + V(z; t). 
After 100 steps of equilibration at the initial z(t), z(t) 
is linearly moved over 750 steps from —1.5 to 1.5 in for- 
ward processes and 1.5 to —1.5 in the reverse. The po- 
sition at each time step is calculated using the equation 
z t = Zt-i - dU{ 2~ l] DAt + {2DAtf/ 2 R tl where the dif- 
fusion coefficient is D = 1, the time step is At = 0.001, 
and R t ~ N(0, 1) is a random number from the standard 
normal distribution. 

As previously noted, 28,46,47,48 unidirectional sampling 
leads to significant apparent bias in estimates of A/ t (Fig. 
1). In addition to the increased bias as the system is 
driven further from equilibrium, we further observe that 
the estimated variance also increases. Bidirectional sam- 
pling, on the other hand, leads to a significant reduction 
in bias and variance, 28 such that free energy estimate 
is within error bars of the actual free energy. Because 
Aft represents the estimated free energy difference with 
respect to t, the estimated <r 2 (AF t ) increases with t, be- 
coming equal to the well-known Bennett acceptance ratio 
asymptotic variance estimate 15,41 when t = r. 

Similar trends are observed with the Hummcr-Szabo 
PMF estimates (Fig. 2). For unidirectional sampling, 
the finite-sampling bias and estimated variance increases 
when the PMF is far from the region sampled by the 
initial state. With bidirectional sampling, the bias is sig- 
nificantly reduced; the PMF estimate is largely within 
error bars of the actual PMF. 
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FIG. 2: Comparison of PMF estimators: This figure is similar 
to Fig 2 of Ref. 28 , except that error bars are now included and 
the sample size is halved. In the left panel, the unidirectional 
Hummer and Szabo estimator is applied to (a) 250 forward 
(rightward triangles) or 250 reverse (leftward triangles) sam- 
pled paths. In the right panel, the bidirectional estimator is 
applied to 125 sampled paths in each direction (upward trian- 
gles). The exact PMF is shown as a solid line in both panels. 
Error bars (sometimes smaller than the markers) denote one 
standard deviation of Ago (2), estimated using the expressions 
presented here. The vertical dashed lines are at the positions 
considered in Fig. (4). 



To analyze these trends more quantitatively, we re- 
peated the experiment 1000 times. For both Af t 
and go(z), we calculated the bias as B(jFi) = 
sEpif^/," ~ (J 7 )) and the standard deviation as 

a{F f ) = y / ^Ef=i(^'/, s - {F)) 2 , where S = 1000 is the 
number of replicates. The results from these more exten- 
sive simulations support our described trends (Fig. 3). 
For unidirectional sampling, the bias in both Af t and 
go(x) appear to significantly increase around the barrier 
crossing. In the bidirectional free energy estimate, how- 
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Time (s) Position 

FIG. 3: Ratio of estimator bias to standard deviation: This 
ratio is calculated for the (a) free energy and (b) PMF, using 
1000 independent estimates. Each estimate is obtained and 
the type of path sample is indicated as in Figs. (1) and (2). 
The vertical dashed lines are at the times/positions considered 
in Fig. (4). 



ever, the bias is small relative to the variance at all times. 
Notably, in the bidirectional PMF estimate, there is a 
small spike in the bias near the barrier, potentially due 
to reduced sampling in the region. 

While in the large sample limit, the bias in the unidi- 
rectional estimate is expected to be small compared to 
the variance, 47 our distribution of unidirectional e~ A ^ f 
estimates is significantly skewed and does not resemble a 
Gaussian distribution expected by the central limit the- 
orem (data not shown). Hence, the asymptotic limit has 
not been reached and the large relative bias is caused by 
insufficient sampling of rare events with low work values 
that dominate the exponential average. 37 Larger sample 
sizes would be necessary for the distribution of estimates 
to be normally distributed and for the error to be dom- 
inated by the variance (which we estimate here) rather 
than the bias. 

The accuracy of variance estimates may be assessed by 
comparing predicted and observed confidence intervals. 
If the estimates are indeed normally distributed about 
the true value, about 68% of estimates from many inde- 
pendent replicates of the experiment should be within 
one standard deviation of the true value, 95% within 
two, and so forth. Fig. (4) compares confidence inter- 
vals predicted using the described asymptotic variance 
estimators and the actual fraction of estimates within 
the interval. 

We observe that the accuracy of our asymptotic vari- 
ance estimate in characterizing the confidence interval 
largely depends on the presence of bias. In the bidirec- 
tional Aft estimate, where there is little bias, the asymp- 
totic variance estimate works very well. For the uni- 
directional Aft estimates, it works well near the initial 
state but underestimates the error as the system is driven 
further away from equilibrium, concurring with the bias 
trend. In the bidirectional PMF estimate, the asymptotic 
variance estimate accurately describes the confidence in- 
terval except near the barrier, where it slightly underes- 
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FIG. 4: Validation of asymptotic variance estimators: Pre- 
dicted vs. observed fraction of 1000 independent estimates 
that are within an interval of the true value for (a)-(c) Aft 
and (d)-(f) go(z) at the indicated times or positions. Each 
estimate is obtained and the type of path sample is indicated 
as in Figs. (1) and (2). Error bars on these fractions are 95% 
confidence intervals calculated using a Bayesian scheme de- 
scribed in Appendix B of Chodera et. al., 49 except that, 
for numerical reasons, the confidence interval was estimated 
from the variance of the Beta distribution assuming approxi- 
mate normality, rather than from the inverse Beta cumulative 
distribution function. 



timates the uncertainty, probably due to the small spike 
in bias. 

In the regime where the bias is much smaller than the 
variance, B <C <J, the asymptotic variance estimate pro- 
vides a good estimate of the actual statistical error in 
the estimate. This also permits us to model the posterior 
distribution of quantity being estimated as a multivari- 
ate normal distribution with mean T and covariance 0. 
Doing so provides a route to combining estimates from 
independent datasets collected from different path en- 
sembles — such as different pulling speeds or from equi- 
librium and nonequilibrium path ensembles — without 
knowledge of path probability ratios. This is achieved 
by maximizing the product of these posterior distribu- 
tions in a manner similar to the Bayesian approach for 
estimating Af T described in Ref. 17 . 
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APPENDIX A: CLOSED-FORM EXPRESSION FOR THE ASYMPTOTIC VARIANCE, GIVEN 

UNIDIRECTIONAL DATA 

In this appendix, we show that given unidirectional data, the optimal EBS estimate is the sample mean and its 
variance simplifies to the variance of a sample mean. First, consider the application of the optimal EBS estimator, 
Eq. 6, to estimating a noncquilibrium path-ensemble average from a unidirectional data set, 



Nf 

E 



n=l 

N f 



Nf q f [X fr , 



Cf qr f [X fn ] 

Nf 



(Al) 
(A2) 



Dividing both sides by cy, we obtain the sample mean estimator, 



1 iff 

' n=l 



(A3) 



We shall now simplify the asymptotic variance estimate by closely following the procedure of Shirts and Chodera. 27 
When M has full column rank, can be written as (Eq. D7 of Rcf. 27 ), 



= [(M T M)- 1 - N + bl K l K ]- 



(A4) 



where b is an arbitrary multiplicative factor and Ik is a 1 X K matrix of ones. 
The weight matrix M consists of two columns, 



nf 



M nJ r 



\f[Xfn] 



1 



N f cj L qf[X fn ] Nf 



Nfcfq f [Xf n } NfFf ' 



(A5) 
(A6) 



obtained by applying Eqs. 8 and 11. This leads to, 



The matrix M T M has the determinant, 



n: 



7V7 1 Y N ' M 2 



i Nf i 





an 


012 




a.21 


a22 



(A7) 



N f — 

J n — 



n=l 



Nf 



which allows us to write the inverse covariance matrix as, 

1 = 



^-Nf + b ~^ + b 

_ai2 il, 2lLL +b 

D ' D 



By applying the same steps as Appendix E of Shirts and Chodera, 27 we then obtain the determinant 

10-1 = ^, 



(A8) 



(A9) 



(AlO) 
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where a = a\2 = We then obtain the asymptotic covariance estimate, 







D 
lab 



%-b ^-N f + b 



To estimate the variance, we apply Eq. 13, leading to 

a 2 {Tj) « T) (0^ Tf - 20^ / + ©,/) 



/v? iVj 



TV 2 

n=l / 



1 

i 

]v7 



1 Nf 

1 n=l 



Nf 



N f ^ 1 

1 n = l 



1 71=1 



which is the variance of a sample mean estimate. 



(All) 

(A12) 
(A13) 

(A14) 
(A15) 

(A16) 



APPENDIX B: CLOSED-FORM EXPRESSION FOR THE ASYMPTOTIC VARIANCE, GIVEN 

BIDIRECTIONAL DATA 



In this appendix, we obtain a closed-form expression for the asymptotic variance of the optimal EBS estimate for 
Tf, given bidirectional data. We will follow a similar procedure as in Appendix A. For the bidirectional case, the 
weight matrix M consists of three columns, M — [rrif m r mjr f ], where is a column matrix of weights from Eqs. 
8 and 11 corresponding to path-ensemble i. The elements of M are, 



c } l q f [X n ] 



1 



RI, 



NfCj^fiXn} + NrCr^riXn] Nf + N r e~^ X ^ 

c~ 1 g r [A > / „] 1 _ 

Nfc/qflXn] + NrC^qAXn] ~ N f A x »l+N r ~ 

(nm i = (m.) n -k (l ) 

V Tf J N f + N r e-^ \ T S ) f K n >' 



= Nj l e(L n ) 
N- l e{-L n ) 



(Bl) 
(B2) 
(B3) 



where e is defined as the Fermi function, e(L n ) = 1+i }-l„ , and we define L n = W[X n ] — Af t + In (^j/^J- This allows 
us to write M T M as, 



/V 



M T M = 



i (r\x\\ 



e {L n ) 2 N/N ^ ^ ^ 



e(L n )e(-L n ) 



1 ( F\X\ 



e(L n f 



a ff a fr O-fFj 
a J= S J : &rT s a T s J= s 



(B4) 



Using the determinant, 



(B5) 
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we write the inverse covariancc matrix estimator as, 



© 1 = 



f D " -N J + b 

-a^ J ^ f a fr .+a f ^ f a r ^ f 

D T " 

a fr a r ^ f -a f ^ f a rr 

D ' " 



-a F p a.f r +a.fF ,a r T± 



f-Fj 



D 



D 



a fr-CLj-jr ^ —CLfFj CL rr 
D 

a f ^ f a fr -a ff a r ^ f 
4^ 



N r + b ^ f ir D " +b 



D 



By applying the same steps as Appendix E of Shirts and Chodera, 27 we obtain the determinant 

- %(a 2 t - r Nf + a,f r a 

rr Nf) 

1 D 

Applying Eq. 13 to and simplifying, it can be shown that the variance estimate is, 

* 2 (T 2 f ) « jf.Hr.r. 2(-)r. , ■ Q ff ) 

af r (a,f r N f + a rr N r ) 

I 



(B6) 
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